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Abstract — We show that the learning sample complexity 
of a sigmoidal neural network constructed by Sontag (1992) 
required to achieve a given misclassification error under a 
fixed purely atomic distribution can grow arbitrarily fast: for 
any prescribed rate of growth there is an input distribution 
having this rate as the sample complexity, and the bound is 
asymptotically tight. The rate can be superexponential, a non- 
recursive function, etc. We further observe that Sontag's ANN 
is not Glivenko-Cantelli under any input distribution having 
a non-atomic part. 

Keywords -P AC learnability, fixed distribution learning, sam- 
ple complexity, infinite VC dimension, witness of irregularity, 
Sontag's ANN, precompactness. 



I, Introduction 

We begin with a quote of the first part of the open problem 
12.6 from Vidyasagar's book ifTTl (this problem appears 
already in the original 1997 version). 

"How can one reconcile the fact that in distribution-free 
learning, every learnable concept class is also "polyno- 
mially" learnable, whereas this might not be so in fixed- 
distribution learning? 

In the case of distribution-free learning of concept classes 
(...) there are only two possibilities: 

1. 'tf has infinite VC-dimension, in which case *€ is not PAC 
learnable at all. 

2. c <f has finite VC-dimension, in which case ff is not 
only PAC learnable, but the sample complexity mo(e,S) 
is 0(l/s + log(l/<5)). Let us call such a concept class 
"polynomially learnable ". 

In other words, there is no "intermediate" possibility 
of a concept class being learnable, but having a sample 
complexity that is superpolynomial in 1/e. 

In the case of fixed-distribution learning, the situation 
is not so clear. (...) Is there a concept class for which 
every algorithm would require a superpolynomial number of 
samples? The only known way of consructing such a concept 
class would be to (...) attempt to construct a concept class 
whose e-covering number grows faster than any exponential 
in 1/e. It would be interesting to know whether such a 
concept class exists." 



In fact, the existence of a concept class whose sample 
complexity grows exponentially in 1/e under a given fixed 
input distribution was already shown in 1991 by Benedek 
and Itai [2] (Theorem 3.5). Their example consisted of all 
finite subsets of a domain. Later and independently, a rather 
more natural concept class with such properties (generated 
by a neural network) was constructed by Barbara Hammer 
in her Ph.D. thesis |5| (Example 4.4.3 on page 77), cf. also 

m. 

Here we somewhat strengthen the above results and at 
the same time show that the phenomenon is quite common. 
Suppose that a concept class <€ satisfies a slightly stronger 
property than having an infinite VC dimension, namely: ^ 
shatters every finite subset of an infinite set. Fix a sequence 
£fc of desired values of learning precision, converning to 
zero, and let / be an increasing real function on [0, +oo). 
Then one can find a probability measure p, on the domain 
Q of ^ with the property that c £ is PAC learnable under /i, 
but the sample complexity of learning to precision £&, k = 
1, 2,3,.. ., is growing as Q(f(e^ 1 )). The prescribed rate of 
growth can be ridiculouly high, for instance, a non-recursive 
function. The bound is essentially tight. For example, a well- 
known sigmoidal feed-forward neural network of infinite VC 
dimension constructed by Sontag [8| has this property. 

This naturally brings up a question of behaviour of 
Sontag's network J\f under non-atomic input distributions. 
It follows from Talagrand's theory of witness of irregularity 
|9l , iflOl that N is not Glivenko-Cantelli with regard to any 
measure having a non-atomic part. We do not know if a sim- 
ilar property holds for PAC learnability, although it is easy to 
see non-learnability of JV for some common measures (the 
uniform distribution on the interval, the gaussian measure). 
While discussing a relationship between Glivenko-Cantelli 
property, PAC learnability, and precompactness, we give an 
answer to another (minor) question of Vidyasagar. 

Note that we find it instructive to present the above 
observations in the reverse order. In Conclusion, we suggest 
a few open problems and a conjecture supported by the 
results of this note which might shed light on Vidyasagar's 
problem. 



II. Glivenko-Cantelli classes and learnability 

A. PAC learnability and total boundedness 

Benedek and Itai |2| had proved that a concept class ^ 
is PAC learnable under a single probability distribution fi if 
and only if <€ is totally bounded in the L 1 (/i)-distance. Here 
we remind their results. 

Theorem 2.1 (Theorem 4.8 in (2§; Theorem 6.3 in 4771/ ): 
Suppose c <o is a concept class, e > 0, and that B\ , . . . , B), 
is an e/2-cover for *€, Then the minimal empirical 
risk algorithm is PAC to accuracy e. In particular, the 
sample complexity of PAC learning ^ to accuracy e with 
confidence 1 — 6 is 

32, k 

m < — log — . 

£ 

Recall that a subset A of a metric space X is e-separated, 
or e-discrete, if, whenever a, b G A and a ^ b, one has 
d(a, b) > e > 0. The largest cardinality of an e-discrete 
subset of X is the e-packing number of X. For example, the 
following lemma estimates from below the packing number 
of the Hamming cube. 

Lemma 2.2 ( HI IV . Lemma 7.2 on p. 279): Let < e < 
1/4. The Hamming cube {0, 1}", equipped with the nor- 
malized Hamming distance 

dh(x,y) = -\{i: x t ^ y»}| , 
n 

admits a family of elements which are pairwise at a dis- 
tance of at least 2e from each other of cardinality at least 
exp[2(0.5-2e) 2 n]. 

The following is a source of lower bounds on the sample 
complexity. 

Theorem 2.3 (Lemma 4.8 in [2 J; Theorem 6.6 in [11]): 
Suppose ^ is a given concept class, and let e > be 
specified. Then any algorithm that is PAC to accuracy 
e requires at least lg M(2e, < ^ y ,L 1 (fi)) samples, where 
M(2e, c ^ , ,L 1 {p)) denotes the 2e-packing number of the 
concept class 'rf with regard to the L 1 (//)-distance. 

For the most comprehensive presentation of PAC learn- 
ability under a single distribution, see ifTTl , Ch. 6. 

B. Glivenko-Cantelli classes 

A function class & on a domain (a standard Borel 
space) ft is Glivenko-Cantelli with regard to a probability 
distribution /i ([3 1, Ch. 3), or else has the property of uniform 
convergence of empirical means (UCEM property) 1 1 II . if 
for each e > 



sup M ®"<j sup|E„(/) 



tiev 



f£& 



E A „(/)|>e 



as n — > oo. 



(1) 
Here fx® n is the product measure on Vt n , and \i n stands for 
the empirical (uniform) measure on n points, sampled from 
the domain in an i.i.d. fashion. We assume & to assume 
values in an interval (i.e., to be uniformly bounded). The 



notion applies to neural networks as well, if & denotes 
the family of output functions corresponding to all possible 
values of learning parameters. 

Every Glivenko-Cantelli class & is PAC learnable, which 
explains the important role of this notion. In fact, every 
consistent learning rule L will learn &. We find it instructive 
to give a different proof, replying in passing to a remark of 
Vidyasagar [ 1 1 ], p. 241. After proving that every Glivenko- 
Cantelli concept class 1? with regard to a fixed measure \x 
is precompact with regard to the L 1 (/x)-distance, the author 
remarks that his proof is both indirect (Glivenko-Cantelli 
=> PAC learnable =>■ precompact), and does not extend to 
function classes, so it is not known to the author whether 
the result holds if ^ is replaced with a function class & . 

The answer is yes, as is (implicitely) stated in [10| (p. 
379, the beginning of the proof of Proposition 2.5), but a 
deduction is also rather roundabout (proving first the absence 
of a witness of irregularity). In fact, the result is really very 
simple. 

Observation 2.4: Every (uniformly bounded) Glivenko- 
Cantelli function class & with regard to a fixed probability 
measure [i is precompact in the L 1 (/i)-distance. 

Proof: If & is not precompact, then for some £o > 
it contains an infinite Eq -discrete subfamily &' , For every 
finite sample a € Q" there is a further infinite subfamily 
&" C &' of functions whose restrictions to a are at 
a pairwise L l (n n ) -distance < £o/2 from each other (the 
pigeonhole principle coupled with the fact that the restriction 
of 3? to a is L 1 (/i„)-precompact). This means that /i- and 
Hn -expectations of some function of the form |/i — /2I, 
fi E &, i = 1,2, differ between themselves by at least 
£o/2, and for at least one of i £ {1, 2}, 



K(fi 



E„„(.A)|>£o/4 



(an application of the triangle inequality in M). Since the 
latter is true for every sample, no matter the size, & is not 
Glivenko-Cantelli. ■ 

In fact, the same proof works in a slightly more general 
case when & is uniformly bounded by a single function (not 
necessarily integrable). 

This gives an alternative deduction of the implication 
Glivenko-Cantelli => PAC learnability. Admittedly, the re- 
sult obtained is somewhat weaker, as this way we do not get 
consistent learnability. 

C. Talagrand's witness of irregularity 

Talagrand [9], [10] had characterized uniform Glivenko- 
Cantelli function classes with regard to a single distribution 
in terms of shattering. We will remind his main result for 
concept classes only. Let Vt be a measurable space, let ^ be 
a concept class on fi, and let /ibea probability measure on 
£1 A measurable subset A C f2 is a witness of irregularity 
of ^, if /j,(A) > and for every n the set of all n-tuples 
of elements of A shattered by ^ has full measure in A n . 



In other words, /i-almost all n-tuples of elements of A are 
shattered by ¥? . 

Theorem 2.5 (Talagrand [9], Th. 2): A concept class ^ 
is Glivenko-Cantelli with regard to the probability measure 
H if and only if ^ admits no witness of irregulaity. 

Let /i be a probability measure on £1 Recall that a set A 
is an atom if for every measurable B C A one has either 
n(B) = or fi(B) — n(A). The measure fj, is non-atomic 
if it contains no atoms, and purely atomic if the measures 
of atoms add up to one. The restriction of fi to the union of 
atoms is the atomic part of fi. 

Since a witness of irregularity can contain no atoms, the 
following is an immediate corollary of Talagrand's 1987 
result. 

Corollary 2.6: If a measure /i is purely atomic, then every 
concept class c € is uniform Glivenko-Cantelli with regard 
to fj,, and in particular PAC learnable. 

The corollary is easy to prove directly, without using 
subtle results of Talagrand, and the result was observed (in- 
dependently) in 1991 and investigated in detail by Benedek 
and Itai (|2|, Theorem 3.2). Notice that the result does not 
assert polynomial PAC learnability of c €, and we will see 
shortly that the required sample complexity of ^f can grow 
arbitrarily fast. 

D. The neural network of Sontag 

Figure Q] recalls a well-known example of a sigmoidal 
neural network Af constructed by Sontag |8|, pp. 34-36. 
(Cf. also [11], page 389, where the top diagram in Figure Q] 
is borrowed from.) The activation sigmoid is of the form 



, -, 1 -l 
(x) = — tan x - 



cosx 



a(l + x 2 ) 2' 

where a > 2ir is fixed, e.g. a = 100. and the output- 
layer perception has both input weights equal to one and a 
threshold of one. The input-output function of the network 
is given by 

y = r][p(x)], 



where 



p{x) 



2 cos wx 



a(\ + w 2 x 2 ) 

The input space of Af is the space R of real numbers. 

Recall that a collection X\, X2, ■ ■ ■ , x n of real numbers is 
rationally independent if no non-trivial linear combination 
of l,xi,X2, ■ ■ ■ ,x n with rational coefficients vanishes. 

Theorem 2.7 ([8], pp. 42-43): The Sontag network Af 
shatters every rationally independent n-tuple of real inputs 

•^1 ) *^2: • • • 3 X n . 

In particular, the VC dimension of Sontag's network is 
infinite. Besides, it is easy to find an infinite rationally 
independent set, and so every finite subset of such a set 
is shattered by Af. We will need this fact later. 

Here is another extreme property of Sontag's network. 





Figure 1. Sontag's ANN architecture (top) and the activation sigmoid . 
with a = 100 (bottom). 



Theorem 2.8: The neural network of Sontag Af is 
Glivenko-Cantelli under a probability distribution fj, on the 
inputs if and only if /j, is purely atomic. 

Proof: Sufficiency (<=) follows from Corollary 12.61 Let 
us prove necessity (=>). By splitting /i into a purely atomic 
part /j, a and a continuous part /i c , in view of Theorems 12.51 
of Talagrand and 12.71 of Sontag, it suffices to prove that 
for every non-atomic probability measure vonR the set of 
rationally independent n-tuples has a full v® n measure in 
R n : the support of /i c will then be a witness of irregularity. 
In its turn, this reduces to a proof that for a fixed collection 
(Ai, . . . , A„+i) of rationals not all of which are zero, the 
affine hyperplane 

H x = {xeR n : (x 1 \)=\ n+1 } 1 

where A = (Ai, . . . , A„), has ^"-measure null. This is a 
consequence of Eggleston's theorem J4|: If A is a measur- 
able, Lebesgue-positive subset of the unit square, then there 
is a measurable positive set B and a perfect set C such 



V V 



Figure 2. The function p for a = 100 and w 
corresponding output binary function (bottom). 



5 (top) and the 



that B x C is included in A. "Lebesgue measure on the 
unit square" here is not a loss of generality, as every two 
non-atomic standard Borel probability measure spaces are 
isomorphic, and we obtain by induction that if A C 1" and 
v® n (A) > 0, then A contains a product of n sets one of 
which is v® n -measure positive and all the rest are perfect 
(contain no isolated points). Clearly, no (n — l)-hyperplane 
in M. n can have this property. ■ 

Example 2.9: Sontag's ANN is not PAC learnable under 
the uniform distribution on an interval. 

Indeed, for the sequence of learning parameters Wk = 2 fc 
the corresponding output binary functions are at a pairwise 
X 1 (A)-distance 1/2 from each other, where A is a uniform 
distribution on some interval. 

A similar argument works for the gaussian distribution on 
the inputs. 



However, we do not know if there exists a non-atomic 
measure under which Sontag's ANN is PAC learnable. 

E. Glivenko-Cantelli versus learnability 

Not every PAC learnable function, or even concept, class 
is Glivenko-Cantelli. Examples of such concept classes exist 
trivially, e.g. the concept class consisting of all finite and 
all cofinite subsets of the unit intervals is PAC learnable 
under every non-atomic distribution, yet clearly not uniform 
Glivenko-Cantelli, cf. 0, p. 385, note (2), or 031, P- 230, 
Example 6.4. A more interesting example, though based on 
the same idea, is Example 6.6 in ifTTIl . p. 232. Here we 
present such an example of a countable concept class. 

Example 2.10: For n £ N, say that intervals [i/n, (i + 
l)/n], i = 0, 1, ... ,n — 1, are of order n. Let 1f n consist 
of all unions of less than yjn intervals of order n, and set 
c € = Ugl^n. If now k £ N is any and Xi < X2 < ■ ■ ■ < Xk 
are points of the unit interval, choose n > k 2 so that 1/n is 
smaller than any of the half-distances between neighbouring 
points (#j+i — Xi)/2, i — 1,2, ... ,n. Clearly, elements of 
'rfn shatter the sample {xi,X2, ■ ■ ■ ,Xk}, and so the entire 
interval is a witness of irregularity for the concept class ^ '. 
By Talagrand's result, the class & is not Glivenko-Cantelli. 
At the same time, for every n, ^ n forms an n _1 / 2 -net for ^ 
with regard to the L 1 (A)-distance, and so ^ is PAC learnable 
under the Lebesgue measure A (the uniform measure on the 
interval). 

Observe that, in fact, ^ fails the Glivenko-Cantelli prop- 
erty with regard to every measure having a non-atomic part. 
As we have seen, there exist non-atomic measures under 
which c & is PAC learnable. There are also measures under 
which c € is not PAC learnable. for example the Haar measure 
v on the Cantor set. 

Recall the construction of the Cantor "middle third" set C 
(Figure[5]i. This is the set left of the closed unit interval [0, 1] 
after first deleting the middle third (1/3, 2/3), then deleting 
the middle thirds of the two remaining intervals, (1/9, 2/9) 
and (7/9,8/9), and continuing to delete the middle thirds 
ad infimum. The elements of the Cantor set are exactly those 
real numbers between and 1 admitting a ternary expansion 
not containing 1, Sometimes C is called Cantor dust. The 
complement to the Cantor set is a union of countably many 
open intervals, all the middle thirds left out. The set C n 
left after the first n steps of removing the middle thirds is 
the union of 2™ closed intervals of equal length 3~™ each. 
The Haar measure of every such interval is set to be equal 
to 2~", and this condition defines a non-atomic measure v 
supported on C in a unique way. 

It is easy to see now that the closed intervals 
I\ , 1%, . . . , I<zn at the level n are shattered with concept 
classes from ^jv if N is large enough (> 2 2 "), in the 
following sense: for every set of indices J C {1, 2, . . . , 2"} 
there is a C £ ^v which contains every interval Ij, j £ J, 
and is disjoint from every interval Ik, where k $ J. Now 



two middle thirds removed at the 
^^ second step 



middle third 
removed at 
the 1st step 

12 <-' I" 




Figure 3. Construction of the Cantor set, after n = 2 steps. 



one can modify the proof of Lemma 12.21 exactly as it was 
done in Q, proof of Theorem 3, in order to conclude that 
If is not totally bounded in the L 1 (i^)-distance. 

III. All rates of sample complexity are possible 

Theorem 3.1: Let ^ be a concept class which shatters 
every finite subset of some infinite set. Let {sk), £k I 
be a sequence of positive reals converging to zero, and let 
/ : M + — > R_|_ be a non-decreasing function growing at least 
linearly: f(x) = £l(x). Then there is a probability measure 
H = fi((ek), f) on the input domain £1 with the property that 
for every 5 > and k € N the class c € is PAC learnable 
under the distribution /1 to accuracy Ek, and the rate of 
required sample complexity is at least 



n(e k ,S)=n[f 



(2) 



Moreover, the above estimate is essentially tight in the sense 
that the sample complexity 



/i(--,,.rfi = o(/(i- 



i«,,(i 



(3) 



suffices to learn "if to accuracy Ast with confidence 1—5. 
Proof: We can assume without loss in generality that 
ci = 1/5. For every k, set mk = 5(sk+i — £fc). Then vtik 
form a sequence of non-negative reals which sums up to 
one. Denote, for simplicity, fk = f{e : k l ). Further, choose 
pairwise disjoint finite sets Fk of cardinality \Fk\ = fk — 
/fc-i (where /q = 0) in a way that every union of finitely 
many of Fk's is shattered by 'if (this is possible due to 
the assumption on the class 'if). Let Hk denote a uniform 
measure supported on Fk of total mass mk- Now set ji = 
Y^iLi^k- Since 2~2'£Li m k = 1, /i is a probability Borel 
measure. 

Let k be arbitrary. Select any subset of ¥? shattering 
UfLi-Fj and containing 



n^i= 2/fc 



elements. This set forms a finite e/c-net in 'if with regard 
to the L l (n) -distance. Since £/. 4. 0, we use Theorem 12.11 
to conclude: the class 'if is PAC learnable under fx, and 
the sample complexity of learning c £ to accuracy Sk and 
confidence 1 — 5, 5 > is 



For every k, Lemma 12721 applied with e = 0.2, guarantees 
the existence of a subset $& of 'if every two elements of 
which are at a L 1 (/ii) -distance > 0.42mi from each other, 
and containing > exp[0.0128(/fc — /fc-i)] elements. Let 
N be so large that Y^,k=i m k — (1-05) -1 . Fix k. Since 
U^ =1 Fk is shattered by ( €, one can find elements of ^ which 
correspond to elements of the product Y[i=k ^*> anc ^ ever Y 
two of which are at a distance > 0.42 Y^k=i m k £ k > 0.4efe 
from each other. According to Theorem 12.31 this means 
that the computational complexity of learning 'if under fi 
to accuracy £& with confidence 1 — 6 is at least 0.0128/^ 
samples. ■ 

Remark 3.2: The measure /x constructed in the proof is 
purely atomic. However, by replacing the domain fl with 
Q x [0, 1], every concept C <E ff with C x [0, 1], and fi 
with the product n X, where A is the uniform (Lebesgue) 
measure on the interval, one can "translate" every example 
as above into an example of learning under a non-atomic 
probability distribution. 

Corollary 3.3: Let v be a probability distribution on a 
domain il having infinite support. Then there exist concept 
classes c € which are PAC learnable under v and whose 
required sample complexity is arbitrarily high. 

Proof: The measure space (f2, v) admits a measure- 
preserving map <f> to the measure space constructed in the 
proof of Theorem 13.11 in such a way that v^ 1 — /1 (here 
one uses the fact that \x is purely atomic). Now the concept 
class ^(j) -1 , consisting of all sets <f)~ l (C), has the same 
learning properties under the distribution v as the class ^ 
has under /i. ■ 

Corollary 3.4: Let £& 4- be a sequence of positive 
values converging to zero, and let fk be a real function on 
[0, +00) growing at least linearly. Then there is a probability 
distribution /i on the real numbers under which Sontag's 
network JV is PAC learnable to accuracy Ek with confidence 
1 — <5, requiring the sample of size ^(/(e^ 1 )). This estimate 
is essentially tight, because the sample size 



n(e t ,S)=0[f 



( l 



log 



(4) 



\£kj ' ° \S / 

already suffices to train J\f to accuracy Aek with confidence 
1-5. 

Remark 3.5: It is easy to construct concept classes which 
are PAC learnable under every input distribution, and yet 
exhibit all possible rates of learning sample complexity. 
These are the classes ^ which, speaking informally, cannot 



tell a difference between a given probability distribution /i 
and some purely atomic measure v. More precisely, if the 
sigma-algebra of sets generated by c <o is purely atomic and 
■«? shatters every finite subset of an infinite set, then c € will 
have the above property. 

An example is a class c <a that consists of all finite unions 
of middle thirds of the Cantor set C. The atoms of the 
sigma-algebra of sets generated by this class are precisely 
the middle thirds, and so ^ has the desired property. 

IV. Conclusion 

Stimulated by a question embedded into the Problem 
12.6 of Vidyasagar [11], we have shown that all rates of 
sample compleixity growth are possible for distribution- 
dependent learning, in particular all are realized by binary 
output feed-forward sigmoidal neural network of Sontag. 
Now Vidyasagar continues thus: 

"1 would like to have an "intrinsic" explanation as to 
why in distribution- free learning, every learnable concept 
class is also forced to be polynomially learnable. Next, how 
far can one "push" this line of argument? Suppose V is 
a family of probabilities that contains a ball in the total 
variation metric p. From Theorem 8.8 it follows that every 
concept class that is learnable with respect to V must also 
be polynomially learnable (because c & must have finite VC- 
dimension). Is it possible to identify other such classes of 
probabilities?" 

We suggest the following conjecture, which, in our view, 
is the right framework in which to address Vidyasagar's 
question. 

Conjecture ("the sample complexity alternative"). Let V 
be a family of probability distributions on the domain fi. 
Then either every class learnable under V is learnable with 
sample complexity (9(e _1 ), or else there exist PAC learnable 
classes under V whose required sample complexity grows 
arbitrarily fast. 

The classical VC theory tells that the conjecture is true 
if V is the family of all probability measures: namely, the 
first alternative holds always. In view of Corollary 13.31 the 
conjecture is also true in the other extreme case, where 
V = {/i} contains a single distribution: unless /j, is finitely- 
supported, we have the second alternative. 

Problem 1. Does the above alternative hold for every 
family V of probability distributions on the inputs? 

Problem 2. Does there exist a non-atomic probability mea- 
sure on R under which the Sontag ANN is PAC learnable? 



Problem 3. Give a criterion for a concept class to be PAC 
learnable under a fixed probability distribution in terms of 
shattering. 

Some sufficient conditions can be found in Q, Q], but 
none of them is also necessary. The "right" condition will 
be strictly intermediate between the witness of irregularity 
0, ifTUl and the VC dimension modulo countable sets J7J. 
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